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Abstract. The development of cancer is largely driven by the gain or loss of sub- 
sets of the genome, promoting uncontrolled growth or disabling defenses against 
it. Identifying genomic regions whose DNA copy number deviates from the nor- 
mal is therefore central to understanding cancer evolution. Array-based compar- 
ative genomic hybridization (aCGH) is a high-throughput technique for identi- 
fying DNA gain or loss by quantifying total amounts of DNA matching defined 
probes relative to healthy diploid control samples. Due to the high level of noise in 
microarray data, however, interpretation of aCGH output is a difficult and error- 
prone task. 

In this work, we tackle the computational task of inferring the DNA copy num- 
ber per genomic position from noisy aCGH data. We propose CGHTRIMMER, a 
novel segmentation method that uses a fast dynamic programming algorithm to 
solve for a least-squares objective function for copy number assignment. CGHTRIM- 
MER consistently achieves superior precision and recall to leading competitors on 
benchmarks of synthetic data and real data from the Coriell cell lines. In addition, 
it finds several novel markers not recorded in the benchmarks but plausibly sup- 
ported in the oncology literature. Furthermore, CGHTRIMMER achieves superior 
results with run-times from 1 to 3 orders of magnitude faster than its state-of-art 
competitors. 

CGHTRIMMER provides a new alternative for the problem of aCGH discretiza- 
tion that provides superior detection of fine-scale regions of gain or loss yet is fast 
enough to process very large data sets in seconds. It thus meets an important need 
for methods capable of handling the vast amounts of data being accumulated in 
high-throughput studies of tumor genetics. 

1 Introduction 

Tumorigenesis is a complex phenomenon often characterized by the successive acqui- 
sition of combinations of genetic aberrations that result in malfunction or disregula- 
tion of genes. There are many forms of chromosome aberration that can contribute to 
cancer development, including polyploidy, aneuploidy, interstitial deletion, reciprocal 
translocation, non-reciprocal translocation, as well as amplification, again with several 



different types of the latter (e.g., double minutes, HSR and distributed insertions 1 1 1). 
Identifying the specific recurring aberrations, or sequences of aberrations, that char- 
acterize particular cancers provides important clues about the genetic basis of tumor 
development and possible targets for diagnostics or therapeutics. Many other genetic 
diseases are also characterized by gain or loss of genetic regions, such as Down Syn- 
drome (trisomy 21) |24l, Cri du Chat (5p deletion) |25|, and Prader-Willi syndrome 
(deletion of 15ql 1-13) |5 | and recent evidence has begun to suggest that inherited copy 
number variations are far more common and more important to human health than had 
been suspected just a few years ago |45|. These facts have created a need for methods 
for assessing DNA copy number variations in individual organisms or tissues. 

In this work, we focus specifically on array-based comparative genomic hybridiza- 
tion (aCGH) I4I35I18I34I . a method for copy number assessment using DNA microar- 
rays that remains, for the moment, the leading approach for high-throughput typing of 
copy number abnormalities. The technique of aCGH is schematically represented in 
Figure [T] A test and a reference DNA sample are differentially labeled and hybridized 
to a microarray and the ratios of their fluorescence intensities is measured for each spot. 
A typical output of this process is shown in Figure [T] (3), where the genomic profile of 
the cell line GM05296 |[39ll is shown for each chromosome. The x-axis corresponds to 
genomic position and the y-axis corresponds to a noisy measurement of the ratio log2 ^ 
for each genomic position. For healthy diploid organisms, R=2 and T is the DNA copy 
number we want to infer from the noisy measurements. For more details on the use of 
aCGH to detect different types of chromosomal aberrations, see f\\. 

Converting raw aCGH log fluorescence ratios into discrete DNA copies numbers 
is an important but non-trivial problem in using aCGH to study cancer progression. 
Finding DNA regions that consistently exhibit chromosomal losses or gains in cancers 
provides a crucial means for locating the specific genes involved in development of dif- 
ferent cancer types. It is therefore important to distinguish, when a probe shows unusu- 
ally high or low fluorescence, whether that aberrant signal reflects experimental noise 
or a probe that is truly found in a segment of DNA that is gained or lost. Furthermore, 
successful discretization of array CGH data is crucial for understanding the process of 
cancer evolution, since discrete inputs are required for a large family of successful evo- 
lution algorithms, e.g., I8I9I . It is worth noting that manual annotation of such regions, 
even if possible 1391 , is tedious and prone to mistakes due to several sources of noise 
(impurity of test sample, noise from array CGH method, etc.). 

Many algorithms and objective functions have thus been proposed for the problem 
of discretizing and segmenting aCGH data. Many methods, starting with Fridlyand et 
al. lITTl . treat aCGH segmentation as a hidden Markov model (HMM) inference prob- 
lem. The HMM approach has since been extended in various ways, such as through 
the use of Bayesian HMMs |12|, incorporation of prior knowledge of locations of 
DNA copy number polymorphisms (371 , and the use of Kalman filters |[38l . Other 
approaches include wavelet decompositions fT4l, quantile regression | lOl, expectation- 
maximization in combination with edge-filtering | 27 1, genetic algorithms 1 17 1, clustering- 
based methods 1441421 . variants on Lasso regression 140115 1 and various problem- 
specific Bayesian f2\, likelihood fT6l, and other statisical models |26|. A dynamic 
programming approach, in combination with expectation maximimization, has been 
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Fig. 1. Schematic representation of array CGH. Genomic DNA from two cell popu- 
lations (1) is differentially labeled and hybridized in a microarray (2). Typically the 
reference DNA comes from a normal subject. For humans this means that the reference 
DNA comes from a normal diploid genome. The ratios on each spot are measured and 
normalised so that the median log2 ratio is zero. The final result is an ordered tuple con- 
taining values of the fluorescent ratios in each genomic position per each chromosome. 
This is shown in (3) where we see the genomic profile of the cell line GM05296 1391 . 
The problem of denoising array CGH data is to infer the true DNA copy number T per 
genomic position from a set of noisy measurements of the quantity log2 J , where R=2 
for normal diploid humans. 



previously used by Picard et al. f33l. Lai et al. 121] and Willenbrock et al. f43l have 
conducted extensive experimental analysis of the range of available methods, with two 
in particular standing out as the leading approaches in practice. One of these top meth- 
ods is CGHSEG 1 32 1, which assumes that a given CGH profile is a Gaussian pro- 
cess whose distribution parameters are affected by abrupt changes at unknown coor- 
dinates/breakpoints. The other is Circular Binary Segmentation 1301 (CBS), a modifi- 
cation of binary segmentation, originally proposed by Sen and Srivastava f36l, which 
uses a statistical comparison of mean expressions of adjacent windows of nearby probes 
to identify possible breakpoints between segments combined with a greedy algorithm 
to locally optimize breakpoint positions. 

Our main contribution in this work is a new algorithm, CGHtrimmer, for denois- 
ing and segmentation of aCGH data. We develop a novel objective function for the 
problem based on least-squares minimization of errors combined with a regularization 
parameter to favor contiguity across segments. We show how to solve efficiently for 
this objective function through dynamic programming. We then validate the method, in 
comparison to the leading CBS and CGHSEG methods, on a combination of synthetic 
and real benchmarks. Finally, we show that CGHtrimmer yields superior accuracy in 
identifying known breakpoints while performing one to three orders of magnitude faster 
than the comparative methods. The remainder of the paper is organized as follows: Sec- 
tion [2]presents our proposed method and its theoretical analysis. Section[3]describes the 
experimental setup and shows the experimental results of our method compared to two 



Symbol 


Description 


Pi 


measurement of log2 ^ in the i-th probe 


n 


number of probes 


T 


DNA copy number 


R 


reference value, equal to 2 for diploid organisms 


K 


number of segments to be fitted 


A 


regularization parameter 




Average of points {pi, . . . ,Pj}, fi(j) = X^Li P^/j 




Sum of squared errors J2i=i(Pi ~ I^U))'^ 


CBS 


Circular binary segmentation 1 30 1 


CGHSEG 


Picard et al. 1 32 1 



Table 1. Symbols and abbreviations used. 



state-of-art methods f30"321 on both synthetic and real aCGH data. Section[4]concludes 
the paper with a brief summary and discussion. 



2 Proposed Method 

We formulate the problem of denoising aCGH data as a least squares problem with 
a penalty/regularization term for the number of segments. Table [T] shows the symbols 
used throughout the paper for convenience. Let P = {pi, . . . , p^} be a set of n points, 
Pi G M. Our goal is to find a sequence of piecewise constant segments which minimize 
the sum of squared errors per segment and the number of segments K. Ideally, our 
algorithm will fit the data with piecewise constant segments corresponding to the true 
DNA copy number at each genomic position. The intuition for penalizing the number 
of segments is that we expect a strong spatial coherence between nearby probes. Gains 
and losses are likely to occur for DNA segments covering many probes, and adjacent 
probes should therefore usually have the same copy number. We avoid making explicit 
assumptions about the statistical nature of the signal (although we note that the least 
squares method can be interpreted as a maximization of the likelihood function under 
the assumption that noise derives from identically distributed Gaussian random vari- 
ables) since we do not believe we have a strong empirical basis for exactly capturing 
the true correlation structure of aCGH data. To solve this optimization problem, we 
define the key quantity OPTi, given by the following equation: 



ro ifz = 

^ I mini<,<, OPT,_i + EU- {pk - ' + A if z > 

The recursion equation [T] has a straightforward interpretation: OPTi is equal to the 
minimum cost of fitting a set of piecewise constant segments from point pi to pi given 
that the last change in copy numbers occured between points pj-i and pj plus the cost 
of fitting a segment is A. The second term is the minimum squared error for fitting a 
constant segment on points {pj, . . . which is obtained for the constant segment 



with value equal to the average intensity of the points in the segment, i.e., . 
Recursion [IJ directly implies a dynamic programming algorithm, the CGHtrimmer 
algorithm presented as Algorithm 1. For each point Pi, we find a point pj such that j 
is the minimum index over all points before i such that points pj through pi belong 
to the same segment. Since aCGH data are given in the log scale, we first exponenti- 
ate the points, then fit the constant segment by taking the average of the exponentiated 
values from the hypothesized segment, and then return to the log domain by taking 
the logarithm of that constant value. Observe that one can fit a constant segment by 
averaging the log values using Jensen's inequality, but we favor an approach more con- 
sistent with the prior work, which typically models the data assuming i.i.d. Gaussian 
noise in the linear domain. The algorithm decides which points will be assigned to the 
same segment by tracing back, starting from the last point Pn, using the breakpoint vari- 
ables until it assigns the first point pi to a segment. The main computational bottleneck 
of CGHtrimmer is the computation of an auxiliary matrix M, an upper diagonal 
matrix for which rriij is the minimum squared error of fitting a segment from points 
{pi, . . . To avoid a naive algorithm that would simply find the average of those 
points and then compute the squared error, resulting in O(n^) time, we take advantage 
of the following theorem: 



Algorithm 1 CHGtrimmer algorithm. 

Require: Points P — {pi , . . . , p^} 
Require: Regularization parameter A 
Compute an n X n matrix M, where 

for i — lion do 

OPT, ^ mini<,<i OPTj-i + Mj,^ + A 
BREAK, ^ argm\ni<j<i OPTj-i + Mj,i + A 

end for 

tmp ^ n {Assign points to segments} 
while tmp / do 

Assign points {pBREAK,n^p, • • • ,Ptmp} to one segment 

tmp ^ BREAKtmp - 1 
end while 



Theorem 1. Let m(j) and be the average and the minimum squared error of fitting 
a constant segment for points {pi, . . . Then the following equations hold: 

m(,) = — — m(,_i) + -.Pj (2) 



j — \ 



(3) 



Algorithm 2 Computing matrix M efficiently in 0(in?) 

Require: Points P — {pi , . . . , pn} 

Initialize matrix A^W^^'^.Aij =0,i^ j and An = pi. 
Initialize matrix M eW^"" with zeros, 
for i = 1 to n do 

for j = i + 1 to n do 

end for 
end for 

for i = 1 to n do 

for j = i + 1 to n do 

M^j ^ M,,,_i + -^{pj - 

end for 
end for 



For a proof, see |20|. Equations |2] and [3] provide us a way to compute means and 
least squared errors online, leading to the O(n^) Algorithm 2 for computing matrix M. 

The resulting method has 0{ii?) time and space complexity. The algorithm needs to 
store two n x n matrices, M and A, in addition to the size-n OPT and BREAK ma- 
trices needed for the dynamic programming. Therefore the space complexity is O(n^). 
The A and M matrices each have O(n^) entries and require 0(1) time to compute each 
entry, while computing the OPTi and BREAKi entries requires at most 0{n) time for 
each of n entries. Therefore the total running time is O(n^). 



3 Results 
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I denote which datasets are synthetic and real respectively. 



This section is organized as follows: first we describe the experimental setup and 
how we trained our model. Then we present our experimental results on both synthetic 
and real data. 



3.1 Experimental Setup and Datasets 



CGHtrimmer is implemented in MATLAB. The experiments run in a 4GB RAM, 
2.4GHz Intel(R) Core(TM)2 Duo CPU, Windows Vista machine. Our methods were 
compared to existing MATLAB implementations of the CBS algorithm, available via 
the Bioinformatics toolbox, and the CGHSEG algorithm, generously provided by by 
Franc Picard 1321 . CGHSEG was run using heteroscedastic model under the Lavielle 
criterion 1221 . Additional tests using the homoscedastic model showed substantially 
worse performance and are omitted here. All methods were compared using previ- 
ously developed benchmark datasets, shown in Table [2] Follow-up analysis of detected 
regions was conducted by manually searching for significant genes in the Genes-to- 
Systems Breast Cancer Database http : / /www . itb . cnr . it /breast cancer' BTl 
and validating their positions with the UCSC Genome Browser http : //genomeTI 
l ucsc . edu/| The Atlas o f Genetics and Cytogenetics in Oncology and Haematology 
^http : // at lasgenetics oncolo gy . org / was also used to validate the signifi- 
cance of reported cancer-associated genes. 




Fig. 2. ROC curve of CGHtrimmer as a function of A on data from | 43 1. The red arrow 
indicates the point (0.91 and 0.98 recall and precision respectively) corresponding to 
A=0.2, the value used in all subsequent results. 



3.2 Picking A 

The performance of our algorithm depends on the value of the parameter A, which deter- 
mines how much each segment "costs." Clearly, there is a tradeoff between bigger and 
smaller values: excessively large A will lead the algorithm to output a single segment 
while excessively small A will result in each point being fit as its own segment. We pick 



our parameter A using data published in fl3l . These data have been generated by mod- 
ehng real CGH data, thus capturing their nature better than other simplified synthetic 
data and also making them a good training dataset for our model. We used this dataset 
to generate a Receiver Operating Characteristic (ROC) curve using values for A ranging 
from to 4 with increment 0.01 using one of the four datasets in B3l ("above 20"). The 
resulting curve is shown in Figure |2] We then selected A = 0.2, which achieves high 
precision (0.98) and high recall (0.91). All subsequent results reported were obtained 
by setting A equal to 0.2. 




(a) CGHtrimmer 



(b) CBS 



(c) CGHSEG 



Fig. 3. Performance of CGHtrimmer, CBS, and CGHseg on denoising synthetic 
aCGH data from ||2T1- CGHtrimmer and CGHseg exhibit excellent precision and 
recall whereas CBS misses two consecutive genomic positions with DNA copy number 
equal to 3. 



3.3 Synthetic Data 

We use the synthetic data published in II2TII . The data consist of five aberrations of 
increasing widths of 2, 5, 10, 20 and 40 probes, respectively, with Gaussian noise 
N(0,0.25^). Figure [3] shows the performance of CGHtrimmer, CBS, and CGHseg. 



Both CGHtrimmer and CGHseg correctly detect all aberrations, while CBS misses 
the first, smallest region. Run time for CGHtrimmer is 0.007 sec, compared to 1.23 
sec for CGHseg and 60 sec for CBS. 

3.4 Coriell Cell Lines 

The first real dataset we use to evaluate our method is the Coriell cell line BAC array 
CGH data 1391 , which is widely considered a "gold standard" dataset. The dataset is 
derived from 15 fibroblast cell lines using the normalized average of log2 fluorescence 
relative to a diploid reference. To call gains or losses of inferred segments, we assign 
each segment the mean intensity of its probes and then apply a simple threshold test 
to determine if the mean is abnormal. We follow |3 | in favoring ±0.3 out of the wide 
variety of thresholds that have been used |29|. 

Table [3] summarizes the performance of CGHtrimmer, CBS and CGHseg rela- 
tive to previously annotated gains and losses in the Corielle dataset. The table shows 
notably better performance for CGHtrimmer compared to either alternative method. 
CGHtrimmer finds 22 of 23 expected segments with one false positive. CBS finds 
20 of 23 expected segments with one false positive. CGHseg finds 22 of 23 expected 
segments with seven false positives. CGHtrimmer thus achieves the same recall as 
CGHseg while outperforming it in precision and the same precision as CBS while 
outperforming it in recall. In cell line GM03563, CBS fails to detect a region of two 
points which have undergone a loss along chromosome 9, in accordance with the results 
obtained using the Lai et al. ll2T1i synthetic data. In cell line GM03134, CGHSEG makes 
a false positive along chromosome 1 which both CGHtrimmer and CBS avoid. In cell 
line GM01535, CGHseg makes a false positive along chromosome 8 and CBS misses 
the aberration along chromosome 12. CGHtrimmer, however, performs ideally on 
this cell fine. In cell line GM02948, CGHtrimmer makes a false positive along chro- 
mosome 7, finding a one-point segment in 7q21.3d at genomic position 97000 whose 
value is equal to 0.732726. All other methods also make false positive errors on this cell 
line. In GM7081, all three methods fail to find an annotated aberration on chromosome 
15. In addition, CGHSEG finds a false positive on chrosome 11. 

CGHtrimmer also substantially outperforms the comparative methods in run time, 
requiring 5.78 sec for the full data set versus 8.15 min for CGHSEG (an 84.6-fold 
speedup) and 47.7 min for CBS (a 495-fold speedup). 

3.5 Breast Cancer Cell Lines 

To further illustrate the performance of CGHtrimmer and compare it to CBS and 
CGHseg, we applied it to the Berkeley Breast Cancer cell line database f28|. The 
dataset consists of 53 breast cancer cell lines that capture most of the recurrent ge- 
nomic and transcriptional characteristics of 145 primary breast cancer cases. We do 
not have an accepted "answer key" for this data set, but it provides a more extensive 
basis for detailed comparison of differences in performance of the methods on com- 
mon data sets, as well as an opportunity for novel discovery. While we have applied 
the methods to all chromosomes in all cell lines, space limitations prevent us present- 
ing the full results here. We therefore arbitrarily selected three of the 53 cell lines 
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Table 3. Results from applying CGHtrimmer, CBS, and CGHseg to 15 cell lines. 
Rows with listed chromosome numbers (e.g., GM03563/3) corresponded to known 
gains or losses and are annotated with a check mark if the expected gain or loss was 
detected or a "No" if it was not. Additional rows list chromosomes on which segments 
not annotated in the benchmark were detected; we presume these to be false positives. 



and selected three chromosomes per cell line that we believed would best illustrate 
the comparative performance of the methods. The Genes-to-Systems Breast Cancer 



Database http_^ / /www . it b . cnr . it /breastcancer ll4T1l was used to identify 
known breast cancer markers in regions predicted to be gained or lost by one or more of 
the methods, with the UCSC Genome Browser |http : / /genome . ucsc . edu/| used 
to verify the placement of genes. 

We note that CGHtrimmer again had a substantial advantage in run time. For the 
full data set, CGHtrimmer required 22.76 sec, compared to 23.3 min for CGHseg (a 
61.5-fold increase), and 4.95 hrs for CBS (a 783-fold increase). 



Cell Line BT474: Figure |4] shows the performance of each method on the BT474 
cell line. The three methods report different results for chromsome 1, as shown in Fig- 
ures|4ja,b,c), with all three detecting amplification in the q-arm but differing in the detail 
of resolution. CGHtrimmer is the only method that detects region Iq31.2-lq31.3 as 
aberrant. This regions hosts gene NEK7, a candidate oncogene |[T9l and gene KIF14, 
a predictor of grade and outcome in breast cancer |7 1. CGHtrimmer and CBS anno- 
tate the region Iq23.3-lq24.3 as amplified. This region hosts several genes previously 
implicated in breast cancers Ell, such as CREGl (lq24), P0U2F1 (lq22-23), RCSDl 
(Iq22-q24), and BLZFl (lq24). Finally, CGHtrimmer alone reports independent am- 
plification of the gene CHRM3, a a marker of metastasis in breast cancer patients 1411 . 





Fig. 4. VisuaHzation of segmentation output of CGHtrimmer, CBS, and CGHseg for 
cell line BT474 on chromosomes 1 (a,b,c), 5 (d,e,f), and 17 (g,h,i). (a,d,g) CGHtrim- 
MERoutput. (b,e,h) CBSoutput. (c,f,i) CGHSEGoutput. 



For chromosome 5 (Figures |4jd,e,f)), the behavior of the three methods is almost 
identical, with all three reporting amplification of a region known to contain many 
breast cancer markers, including MRPL36 (5p33), ADAMTS16 (5pl5.32), POLS (5pl5.31), 
ADCY2 (5pl5.31), CCT5 (5pl5.2), TAS2R1 (5pl5.31), ROPNIL (5pl5.2), DAP (5pl5.2), 
ANKH (5pl5.2), FBXL7 (5pl5.1), BASPl (5pl5.1), CDH18 (5pl4.3), CDH12 (5pl4.3), 
CDHIO (5pl4.2 - 5pl4.1), CDH9 (5pl4.1) PDZD2 (5pl3.3), G0LPH3 (5pl3.3), MTMR12 
(5pl3.3), ADAMTS12 (5pl3.3 - 5pl3.2), SLC45A2 (5pl3.2), TARS (5pl3.3), RADl 
(5pl3.2), AGXT2 (5pl3.2), SKP2 (5pl3.2), NIPBL (5pl3.2), NUP155 (5pl3.2), KRT18P31 
(5pl3.2), LIFR (5pl3.1) and GDNF (5pl3.2) |41 1. The only difference in the assign- 
ments is that CBS fits one more probe to this amplified segment. 

Finally, for chromosome 17 (Figures |4jg,h,i)), like chromosome 1, all three detect 
amplification but CGHtrimmer predicts a finer breakdown of the amplified region 
into independently amplified segments. All three detect amplification of a region in- 
cluding the major breast cancer biomarkers HER2 (17q21.1) and BRCAl (17q21) and 
the additional markers MSI2 (17q23.2) and TRIM37 (17q23.2) |41 1. While the more 
discontiguous picture produced by CGHtrimmer may appear to be a less parsimo- 
nious explanation of the data, a complex combination of fine-scale gains and losses in 
17q is in fact well supported by the literature Bll . 




(g) (h) (i) 

Fig. 5. Visualization of segmentation output of CGHtrimmer, CBS, and CGHseg 
for cell line HS578T on chromosomes 3 (a,b,c), 11 (d,e,f), and 17 (g,h,i). (a,d,g) 
CGHTRlMMERoutput. (b,e,h) CBSoutput. (c,f,i) CGHSEGoutput. 



Cell Line HS578T: Figure [5] compares the methods on cell line HS578T for chro- 
mosomes 3, 11 and 17. Chromosome 3 (Figures [5ja,b,c)) shows identical prediction 
of an amplification of 3q24-3qter for all three methods, a region including the key 
breast cancer markers PIK3CA (3q26.32) [ 23 1, and additional breast-cancer-associated 
genes TIGl (3q25.32), MME (3q25.2), TNFSFIO (3q26), MUC4 (3q29), TFRC (3q29), 
DLGl (3q29) |41 1. CGHtrimmer and CGHseg also have identical predictions of 
normal copy number in the p-arm, while CBS reports an additional loss between 3p21 
and 3pl4.3. We are unaware of any known gain or loss in this region associated with 
breast cancer. 

For chromosome 11 (Figures [5jd,e,f)), the methods again present an identical pic- 
ture of loss at the q-terminus (llq24.2-llqter) but detect amplifications of the p-arm 
at different levels of resolution. CGHtrimmer and CBS detect gain in the region 
1 Ip 15.5, the site of the HRAS breast cancer metastasis marker [41 J. In contrast to CBS, 
CGHtrimmer detects an adjacent loss region. While we have no direct evidence this 
loss is a true finding, the region of predicted loss does contain EIF3F (llpl5.4), iden- 
tified as a possible tumor suppressor whose expression is decreased in most pancreatic 
cancers and melanomas |41 1. We can thus conjecture that EIF3F is also a tumor sup- 
pressor in breast cancers. 




(g) (h) (i) 

Fig. 6. VisuaHzation of segmentation output of CGHtrimmer, CBS, and CGHseg for 
cell line T47D on chromosomes 1 (a,b,c), 11 (d,e,f), and 20 (g,h,i). (a,d,g) CGHtrim- 
MERoutput. (b,e,h) CBSoutput. (c,f,i) CGHSEGoutput. 



On chromosome 17 (Figures [5]^g,h,i)), the three methods behave similarly, with all 
three predicting amplification of the p-arm. CBS places one more marker in the am- 
plified region causing it to cross the centromere while CGHSEG breaks the amplified 
region into three segments by predicting additional amplification at a single marker. 

Cell Line T47D: Figure [6] compares the methods on chromosomes 1, 8, and 20 of cell 
line T47D. On chromosome 1 (Figure [6|a,b,c)), all three methods detect loss of the 
p-arm and predominant amplification of the q-arm. CBS infers a presumably spurious 
extension of the p-arm loss across the centromere into the q-arm, while the other meth- 
ods do not. The main differences between the three methods appear on the q-arm of 
chromosome 1. CGHtrimmer and CGHSEG both detect a small region of gain prox- 
imal to the centromere at Iq21.1-lq21.2, followed by a short region of loss spanning 
Iq21.3-lq22. CBS merges these into a single longer region of normal copy number. 
The existence of a small region of loss at this location in breast cancers is supported by 
prior literature (Sj. 

The three methods provide comparable segmentations of chromosome 11 (Fig- 
ure [6];d,e,f)). All predict loss near the p-terminus, a long segment of amplification 
stretching across much of the p- and q-arms, and additional amplification near the q- 
terminus. CGHtrimmer, however, breaks this q-terminal amplification into several 
sub-segments at different levels of amplification while CBS and CGHSEG both fit a 
single segment to that region. We have no empirical basis for determining which seg- 
mentation is correct here. CGHtrimmer does appear to provide a spurious break in 
the long amplified segment that is not predicted by the others. 

Finally, along chromosome 20 (Figure[6jg,h,i)), the output of the methods is similar, 
with all three methods suggest that the q-arm has an aberrant copy number, an obser- 
vation consistent with prior studies 1 13 |. The only exception is again that CBS fits one 
point more than the other two methods along the first segment, causing a likely spurious 
extension of the p-arm's normal copy number into the q-arm. 

4 Conclusions 

We have presented CGHtrimmer, a new algorithm for detecting genomic regions of 
loss or gain in aCGH data. We compared CGHtrimmer to two widely used methods, 
CB S 1^30 1 and CGHSEG 1 32] that have previously been identified as the most successful 
among many options in the literature 1211 . CGHtrimmer shows performance identical 
to CGHSEG and superior to CBS on a synthetic benchmark and superior performance 
to both on a benchmark of real cell line data. Further demonstration of the methods on 
selected regions from a large breast cancer cell line dataset suggests that CGHtrim- 
mer is generally superior at detecting fine- scale variations in aCGH data, while avoid- 
ing apparently spurious or misplaced breakpoints assigned by the other methods. Where 
results differ between the methods, there is usually good support in the literature for 
the CGHtrimmer segmentation. Furthermore, CGHtrimmer achieves superior ac- 
curacy with run times more than 50-fold faster than CGHSEG and more than 500-fold 
faster than CBS. 
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